Multilingual Raw Speech Corpus
Your request cart is empty!
Dataset Description
97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav.
The
LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech
corpora published by LDC-IL in various Indian languages. This dataset is built
to address the needs of some applications like language identifier modules
where multiple language samples are a requirement, to explore cross-linguistic
variations and diatopic comparison to determine what generalizations are
possible about the types of variable features, to build multilingual phoneme
set and models etc.
The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers.
The available Speech
Corpus details:
Total Speakers 1916 (958 Female and 958 Male)
A detailed
explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual
Raw Speech Documentation.
For any research-based citations,
please use the following citations:
- Narayan Kumar Choudhary, Rajesha N., Manasa G., 2021. Multilingual Raw Speech Corpus. Central Institute of Indian Languages, Mysore
- Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview
Item specifics
- Authors Narayan Kumar Choudhary, Rajesha N., Manasa G.
- Corpus Type Raw Corpus
- Catalogue Number 1281
- ISBN 978-81-948885-3-6
- Data Source On Field
- Duration 97:43:54
- # of Audio Segments 1916
- Release Date 15-Jun-2021
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.